{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/uptrain-ai/uptrain/blob/main/examples/benchmarks/claude_3_vs_gpt_4.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1 align=\"center\">\n",
    "  <a href=\"https://uptrain.ai\">\n",
    "    <img width=\"300\" src=\"https://user-images.githubusercontent.com/108270398/214240695-4f958b76-c993-4ddd-8de6-8668f4d0da84.png\" alt=\"uptrain\">\n",
    "  </a>\n",
    "</h1>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Claude 3 vs GPT-4\n",
    "Claude 3 was recently launched by Anthropic as a competitor to OpenAI's GPT-4. In this notebook, we will compare the two models to see if you should make the switch from GPT-4 to Claude 3.\n",
    "\n",
    "To do this comparison, we will use UpTrain's Response Matching operator. This operator takes in two values - response and ground_truth - and returns a score between 0 and 1. The score is 1 if the response is very similar the ground_truth and 0 if the response is completely different from the ground_truth.\n",
    "\n",
    "We have curated a dataset of 25 questions and context pairs. For each question, we will get responses from both GPT-4 and Claude 3 Opus. We will take the response from GPT-4 as the ground_truth and compare the response from Claude 3 Opus to the ground_truth using the Response Matching operator. We will then do the same with GPT-3.5-Turbo and Claude 3 Sonnet, respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import the required libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/dhruvchawla/Work/uptrain-v1/.venv/lib/python3.11/site-packages/lazy_loader/__init__.py:185: RuntimeWarning: subpackages can technically be lazily loaded, but it causes the package to be eagerly loaded even if it is already lazily loaded.So, you probably shouldn't use subpackages with this lazy feature.\n",
      "  warnings.warn(msg, RuntimeWarning)\n"
     ]
    }
   ],
   "source": [
    "from uptrain import Settings\n",
    "from uptrain.operators import TextCompletion, JsonReader\n",
    "\n",
    "import os\n",
    "import polars as pl\n",
    "import nest_asyncio\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "shape: (25, 3)\n",
      "┌───────────────────────────────────┬───────────────────────────────────┬─────┐\n",
      "│ question                          ┆ context                           ┆ idx │\n",
      "│ ---                               ┆ ---                               ┆ --- │\n",
      "│ str                               ┆ str                               ┆ i64 │\n",
      "╞═══════════════════════════════════╪═══════════════════════════════════╪═════╡\n",
      "│ How to get a grip on finance?'    ┆ Try downloading a finance app li… ┆ 1   │\n",
      "│ How do “held” amounts appear on … ┆ \"The \"\"hold\"\" is just placeholde… ┆ 2   │\n",
      "│ Does negative P/E ratio mean sto… ┆ P/E is the number of years it wo… ┆ 3   │\n",
      "│ Should a retail trader choose a … ┆ \"That\\'s like a car dealer adver… ┆ 4   │\n",
      "│ Possibility to buy index funds a… ┆ \"As user quid states in his answ… ┆ 5   │\n",
      "│ …                                 ┆ …                                 ┆ …   │\n",
      "│ Discuss the role of inflation in… ┆ Inflation is a pervasive economi… ┆ 21  │\n",
      "│ Explain the concept of plate tec… ┆                                   ┆ 22  │\n",
      "│                                   ┆                                   ┆     │\n",
      "│                                   ┆ The Earth's dynamic and ever-c…   ┆     │\n",
      "│ How did the surrealist movement … ┆                                   ┆ 23  │\n",
      "│                                   ┆                                   ┆     │\n",
      "│                                   ┆                                   ┆     │\n",
      "│                                   ┆ The Surrealist movement, whic…    ┆     │\n",
      "│ Discuss the impact of globalizat… ┆                                   ┆ 24  │\n",
      "│                                   ┆                                   ┆     │\n",
      "│                                   ┆ Globalization, characterized b…   ┆     │\n",
      "│ What are the key differences bet… ┆                                   ┆ 25  │\n",
      "│                                   ┆ In the realm of infectious dise…  ┆     │\n",
      "└───────────────────────────────────┴───────────────────────────────────┴─────┘\n"
     ]
    }
   ],
   "source": [
    "url = \"https://uptrain-assets.s3.ap-south-1.amazonaws.com/data/uptrain_benchmark.jsonl\"\n",
    "dataset_path = os.path.join('./', \"uptrain_benchmark.jsonl\")\n",
    "\n",
    "if not os.path.exists(dataset_path):\n",
    "    import httpx\n",
    "    r = httpx.get(url)\n",
    "    with open(dataset_path, \"wb\") as f:\n",
    "        f.write(r.content)  \n",
    "\n",
    "dataset = pl.read_ndjson(dataset_path)\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Experiment 1: Claude 3 Opus vs GPT-4\n",
    "Now that we have the dataset, we can start the experiment. We will start by comparing Claude 3 Opus to GPT-4."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get responses from Claude 3 Opus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/25 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 25/25 [05:33<00:00, 13.32s/it]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (25, 5)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>question</th><th>context</th><th>idx</th><th>model</th><th>claude_3_opus_response</th></tr><tr><td>str</td><td>str</td><td>i64</td><td>str</td><td>str</td></tr></thead><tbody><tr><td>&quot;How to get a g…</td><td>&quot;Try downloadin…</td><td>1</td><td>&quot;claude-3-opus-…</td><td>&quot;Getting a grip…</td></tr><tr><td>&quot;How do “held” …</td><td>&quot;&quot;The &quot;&quot;hold&quot;&quot; …</td><td>2</td><td>&quot;claude-3-opus-…</td><td>&quot;When a credit …</td></tr><tr><td>&quot;Does negative …</td><td>&quot;P/E is the num…</td><td>3</td><td>&quot;claude-3-opus-…</td><td>&quot;A negative P/E…</td></tr><tr><td>&quot;Should a retai…</td><td>&quot;&quot;That\\&#x27;s like …</td><td>4</td><td>&quot;claude-3-opus-…</td><td>&quot;The decision t…</td></tr><tr><td>&quot;Possibility to…</td><td>&quot;&quot;As user quid …</td><td>5</td><td>&quot;claude-3-opus-…</td><td>&quot;Yes, it is pos…</td></tr><tr><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td></tr><tr><td>&quot;Discuss the ro…</td><td>&quot;Inflation is a…</td><td>21</td><td>&quot;claude-3-opus-…</td><td>&quot;Inflation is a…</td></tr><tr><td>&quot;Explain the co…</td><td>&quot;\n",
       "\n",
       "The Earth&#x27;s …</td><td>22</td><td>&quot;claude-3-opus-…</td><td>&quot;Plate tectonic…</td></tr><tr><td>&quot;How did the su…</td><td>&quot;\n",
       "\n",
       "\n",
       "The Surreal…</td><td>23</td><td>&quot;claude-3-opus-…</td><td>&quot;The Surrealist…</td></tr><tr><td>&quot;Discuss the im…</td><td>&quot;\n",
       "\n",
       "Globalizatio…</td><td>24</td><td>&quot;claude-3-opus-…</td><td>&quot;Globalization …</td></tr><tr><td>&quot;What are the k…</td><td>&quot;\n",
       "In the realm …</td><td>25</td><td>&quot;claude-3-opus-…</td><td>&quot;Viral and bact…</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (25, 5)\n",
       "┌───────────────────────┬──────────────────────┬─────┬──────────────────────┬──────────────────────┐\n",
       "│ question              ┆ context              ┆ idx ┆ model                ┆ claude_3_opus_respon │\n",
       "│ ---                   ┆ ---                  ┆ --- ┆ ---                  ┆ se                   │\n",
       "│ str                   ┆ str                  ┆ i64 ┆ str                  ┆ ---                  │\n",
       "│                       ┆                      ┆     ┆                      ┆ str                  │\n",
       "╞═══════════════════════╪══════════════════════╪═════╪══════════════════════╪══════════════════════╡\n",
       "│ How to get a grip on  ┆ Try downloading a    ┆ 1   ┆ claude-3-opus-202402 ┆ Getting a grip on    │\n",
       "│ finance?'             ┆ finance app li…      ┆     ┆ 29                   ┆ your finances …      │\n",
       "│ How do “held” amounts ┆ \"The \"\"hold\"\" is     ┆ 2   ┆ claude-3-opus-202402 ┆ When a credit card   │\n",
       "│ appear on …           ┆ just placeholde…     ┆     ┆ 29                   ┆ transaction i…       │\n",
       "│ Does negative P/E     ┆ P/E is the number of ┆ 3   ┆ claude-3-opus-202402 ┆ A negative P/E ratio │\n",
       "│ ratio mean sto…       ┆ years it wo…         ┆     ┆ 29                   ┆ does not ne…         │\n",
       "│ Should a retail       ┆ \"That\\'s like a car  ┆ 4   ┆ claude-3-opus-202402 ┆ The decision to      │\n",
       "│ trader choose a …     ┆ dealer adver…        ┆     ┆ 29                   ┆ choose a broker …    │\n",
       "│ Possibility to buy    ┆ \"As user quid states ┆ 5   ┆ claude-3-opus-202402 ┆ Yes, it is possible  │\n",
       "│ index funds a…        ┆ in his answ…         ┆     ┆ 29                   ┆ to buy both …        │\n",
       "│ …                     ┆ …                    ┆ …   ┆ …                    ┆ …                    │\n",
       "│ Discuss the role of   ┆ Inflation is a       ┆ 21  ┆ claude-3-opus-202402 ┆ Inflation is a       │\n",
       "│ inflation in…         ┆ pervasive economi…   ┆     ┆ 29                   ┆ sustained increas…   │\n",
       "│ Explain the concept   ┆                      ┆ 22  ┆ claude-3-opus-202402 ┆ Plate tectonics is a │\n",
       "│ of plate tec…         ┆                      ┆     ┆ 29                   ┆ scientific …         │\n",
       "│                       ┆ The Earth's dynamic  ┆     ┆                      ┆                      │\n",
       "│                       ┆ and ever-c…          ┆     ┆                      ┆                      │\n",
       "│ How did the           ┆                      ┆ 23  ┆ claude-3-opus-202402 ┆ The Surrealist       │\n",
       "│ surrealist movement … ┆                      ┆     ┆ 29                   ┆ movement, which b…   │\n",
       "│                       ┆                      ┆     ┆                      ┆                      │\n",
       "│                       ┆ The Surrealist       ┆     ┆                      ┆                      │\n",
       "│                       ┆ movement, whic…      ┆     ┆                      ┆                      │\n",
       "│ Discuss the impact of ┆                      ┆ 24  ┆ claude-3-opus-202402 ┆ Globalization has    │\n",
       "│ globalizat…           ┆                      ┆     ┆ 29                   ┆ had a profound…      │\n",
       "│                       ┆ Globalization,       ┆     ┆                      ┆                      │\n",
       "│                       ┆ characterized b…     ┆     ┆                      ┆                      │\n",
       "│ What are the key      ┆                      ┆ 25  ┆ claude-3-opus-202402 ┆ Viral and bacterial  │\n",
       "│ differences bet…      ┆ In the realm of      ┆     ┆ 29                   ┆ infections a…        │\n",
       "│                       ┆ infectious dise…     ┆     ┆                      ┆                      │\n",
       "└───────────────────────┴──────────────────────┴─────┴──────────────────────┴──────────────────────┘"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset_path=\"./uptrain_benchmark.jsonl\"\n",
    "claude_settings = Settings(model=\"claude-3-opus-20240229\", rpm_limit=4)\n",
    "dataset = JsonReader(fpath=dataset_path).setup(settings=claude_settings).run()[\"output\"]\n",
    "\n",
    "dataset = dataset.with_columns([pl.lit(\"claude-3-opus-20240229\").alias(\"model\")])\n",
    "dataset_with_claude_responses = TextCompletion(col_in_prompt=\"question\", col_out_completion=\"claude_3_opus_response\").setup(settings=claude_settings).run(dataset)[\"output\"]\n",
    "dataset_with_claude_responses"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Responses from GPT-4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/25 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 25/25 [00:32<00:00,  1.32s/it]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (25, 6)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>question</th><th>context</th><th>idx</th><th>model</th><th>claude_3_opus_response</th><th>gpt_4_response</th></tr><tr><td>str</td><td>str</td><td>i64</td><td>str</td><td>str</td><td>str</td></tr></thead><tbody><tr><td>&quot;How to get a g…</td><td>&quot;Try downloadin…</td><td>1</td><td>&quot;gpt-4&quot;</td><td>&quot;Getting a grip…</td><td>&quot;1. Take online…</td></tr><tr><td>&quot;How do “held” …</td><td>&quot;&quot;The &quot;&quot;hold&quot;&quot; …</td><td>2</td><td>&quot;gpt-4&quot;</td><td>&quot;When a credit …</td><td>&quot;&quot;Held&quot; amounts…</td></tr><tr><td>&quot;Does negative …</td><td>&quot;P/E is the num…</td><td>3</td><td>&quot;gpt-4&quot;</td><td>&quot;A negative P/E…</td><td>&quot;A negative P/E…</td></tr><tr><td>&quot;Should a retai…</td><td>&quot;&quot;That\\&#x27;s like …</td><td>4</td><td>&quot;gpt-4&quot;</td><td>&quot;The decision t…</td><td>&quot;Whether a reta…</td></tr><tr><td>&quot;Possibility to…</td><td>&quot;&quot;As user quid …</td><td>5</td><td>&quot;gpt-4&quot;</td><td>&quot;Yes, it is pos…</td><td>&quot;Yes, it is pos…</td></tr><tr><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td></tr><tr><td>&quot;Discuss the ro…</td><td>&quot;Inflation is a…</td><td>21</td><td>&quot;gpt-4&quot;</td><td>&quot;Inflation is a…</td><td>&quot;Inflation is a…</td></tr><tr><td>&quot;Explain the co…</td><td>&quot;\n",
       "\n",
       "The Earth&#x27;s …</td><td>22</td><td>&quot;gpt-4&quot;</td><td>&quot;Plate tectonic…</td><td>&quot;Plate tectonic…</td></tr><tr><td>&quot;How did the su…</td><td>&quot;\n",
       "\n",
       "\n",
       "The Surreal…</td><td>23</td><td>&quot;gpt-4&quot;</td><td>&quot;The Surrealist…</td><td>&quot;Surrealism tre…</td></tr><tr><td>&quot;Discuss the im…</td><td>&quot;\n",
       "\n",
       "Globalizatio…</td><td>24</td><td>&quot;gpt-4&quot;</td><td>&quot;Globalization …</td><td>&quot;Globalization …</td></tr><tr><td>&quot;What are the k…</td><td>&quot;\n",
       "In the realm …</td><td>25</td><td>&quot;gpt-4&quot;</td><td>&quot;Viral and bact…</td><td>&quot;Viral and bact…</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (25, 6)\n",
       "┌─────────────────────┬────────────────────┬─────┬───────┬────────────────────┬────────────────────┐\n",
       "│ question            ┆ context            ┆ idx ┆ model ┆ claude_3_opus_resp ┆ gpt_4_response     │\n",
       "│ ---                 ┆ ---                ┆ --- ┆ ---   ┆ onse               ┆ ---                │\n",
       "│ str                 ┆ str                ┆ i64 ┆ str   ┆ ---                ┆ str                │\n",
       "│                     ┆                    ┆     ┆       ┆ str                ┆                    │\n",
       "╞═════════════════════╪════════════════════╪═════╪═══════╪════════════════════╪════════════════════╡\n",
       "│ How to get a grip   ┆ Try downloading a  ┆ 1   ┆ gpt-4 ┆ Getting a grip on  ┆ 1. Take online     │\n",
       "│ on finance?'        ┆ finance app li…    ┆     ┆       ┆ your finances …    ┆ courses and Works… │\n",
       "│ How do “held”       ┆ \"The \"\"hold\"\" is   ┆ 2   ┆ gpt-4 ┆ When a credit card ┆ \"Held\" amounts,    │\n",
       "│ amounts appear on … ┆ just placeholde…   ┆     ┆       ┆ transaction i…     ┆ also known as \"p…  │\n",
       "│ Does negative P/E   ┆ P/E is the number  ┆ 3   ┆ gpt-4 ┆ A negative P/E     ┆ A negative P/E     │\n",
       "│ ratio mean sto…     ┆ of years it wo…    ┆     ┆       ┆ ratio does not ne… ┆ ratio doesn't nec… │\n",
       "│ Should a retail     ┆ \"That\\'s like a    ┆ 4   ┆ gpt-4 ┆ The decision to    ┆ Whether a retail   │\n",
       "│ trader choose a …   ┆ car dealer adver…  ┆     ┆       ┆ choose a broker …  ┆ trader should c…   │\n",
       "│ Possibility to buy  ┆ \"As user quid      ┆ 5   ┆ gpt-4 ┆ Yes, it is         ┆ Yes, it is         │\n",
       "│ index funds a…      ┆ states in his      ┆     ┆       ┆ possible to buy    ┆ possible for       │\n",
       "│                     ┆ answ…              ┆     ┆       ┆ both …             ┆ Canadian…          │\n",
       "│ …                   ┆ …                  ┆ …   ┆ …     ┆ …                  ┆ …                  │\n",
       "│ Discuss the role of ┆ Inflation is a     ┆ 21  ┆ gpt-4 ┆ Inflation is a     ┆ Inflation is a     │\n",
       "│ inflation in…       ┆ pervasive economi… ┆     ┆       ┆ sustained increas… ┆ vital element in … │\n",
       "│ Explain the concept ┆                    ┆ 22  ┆ gpt-4 ┆ Plate tectonics is ┆ Plate tectonics is │\n",
       "│ of plate tec…       ┆                    ┆     ┆       ┆ a scientific …     ┆ a theory expl…     │\n",
       "│                     ┆ The Earth's        ┆     ┆       ┆                    ┆                    │\n",
       "│                     ┆ dynamic and        ┆     ┆       ┆                    ┆                    │\n",
       "│                     ┆ ever-c…            ┆     ┆       ┆                    ┆                    │\n",
       "│ How did the         ┆                    ┆ 23  ┆ gpt-4 ┆ The Surrealist     ┆ Surrealism         │\n",
       "│ surrealist movement ┆                    ┆     ┆       ┆ movement, which b… ┆ tremendously       │\n",
       "│ …                   ┆                    ┆     ┆       ┆                    ┆ impacted…          │\n",
       "│                     ┆ The Surrealist     ┆     ┆       ┆                    ┆                    │\n",
       "│                     ┆ movement, whic…    ┆     ┆       ┆                    ┆                    │\n",
       "│ Discuss the impact  ┆                    ┆ 24  ┆ gpt-4 ┆ Globalization has  ┆ Globalization has  │\n",
       "│ of globalizat…      ┆                    ┆     ┆       ┆ had a profound…    ┆ significant im…    │\n",
       "│                     ┆ Globalization,     ┆     ┆       ┆                    ┆                    │\n",
       "│                     ┆ characterized b…   ┆     ┆       ┆                    ┆                    │\n",
       "│ What are the key    ┆                    ┆ 25  ┆ gpt-4 ┆ Viral and          ┆ Viral and          │\n",
       "│ differences bet…    ┆ In the realm of    ┆     ┆       ┆ bacterial          ┆ bacterial          │\n",
       "│                     ┆ infectious dise…   ┆     ┆       ┆ infections a…      ┆ infections a…      │\n",
       "└─────────────────────┴────────────────────┴─────┴───────┴────────────────────┴────────────────────┘"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gpt_settings = Settings(model=\"gpt-4\", rpm_limit=100)\n",
    "dataset = dataset_with_claude_responses.with_columns([pl.lit(\"gpt-4\").alias(\"model\")])\n",
    "experiment_dataset = TextCompletion(col_in_prompt=\"question\", col_out_completion=\"gpt_4_response\").setup(settings=gpt_settings).run(dataset)[\"output\"]\n",
    "experiment_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use the Response Matching operator to get the scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m2024-03-07 10:44:05.173\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate_on_server\u001b[0m:\u001b[36m341\u001b[0m - \u001b[1mSending evaluation request for rows 0 to <50 to the Uptrain\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m2024-03-07 10:44:30.283\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate\u001b[0m:\u001b[36m330\u001b[0m - \u001b[1mServer is not running!\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "from uptrain import EvalLLM, ResponseMatching\n",
    "\n",
    "settings = Settings(evaluate_locally=False)\n",
    "\n",
    "# Drop the \"context\" and \"model\" columns as they are not needed for local evaluation\n",
    "experiment_dataset = experiment_dataset.drop([\"context\", \"model\"])\n",
    "\n",
    "eval_llm = EvalLLM(settings=settings)\n",
    "results = eval_llm.evaluate(\n",
    "    data=experiment_dataset,\n",
    "    checks=[\n",
    "        ResponseMatching(\n",
    "            method=\"llm\",\n",
    "        )\n",
    "    ],\n",
    "    schema={\n",
    "        \"question\": \"question\",\n",
    "        \"response\": \"claude_3_opus_response\",\n",
    "        \"ground_truth\": \"gpt_4_response\",\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's analyze the results. First, we will take the mean of the scores to get an overall idea of how well Claude 3 Opus performs compared to GPT-4."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9274770685464783"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "avg_score = pl.DataFrame(results)[\"score_response_match\"].mean()\n",
    "avg_score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a good score and shows that Claude 3 Opus is a good competitor to GPT-4. However, we need to look at the individual scores to see if Claude 3 Opus is better than GPT-4 in some cases."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take an example and see the scores for each model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Question: How to get a grip on finance?'\n"
     ]
    }
   ],
   "source": [
    "row = results[0]\n",
    "print(\"Question:\", row[\"question\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPT-4 Response:\n",
      "\n",
      "\n",
      "1. Take online courses and Workshops: Several online platforms such as Coursera, Udemy, and Khan Academy offer introductory courses on finance. \n",
      "\n",
      "2. Read Books: Reading books can give you a profound understanding of finance. Some popular books include: \"The Intelligent Investor\" by Benjamin Graham, \"Common Stocks and Uncommon Profits\" by Philip Fisher and \"Thinking, Fast and Slow\" by Daniel Kahneman.\n",
      "\n",
      "3. Attend Seminars: Attending seminars and workshops can provide first-hand knowledge as well as an opportunity to interact with industry professionals.\n",
      "\n",
      "4. Networking: Joining a local finance or investment club can provide opportunities for learning from others' experiences.\n",
      "\n",
      "5. Use Finance Apps: Personal finance apps such as Mint and PocketGuard can help keep track of individual income and spending and create budgets.\n",
      "\n",
      "6. Follow Finance Blogs and Websites: Websites such as Investopedia can provide helpful articles, glossaries, and tutorials on various finance topics.\n",
      "\n",
      "7. Watch finance-related documentaries and shows: They give a practical understanding of how financial markets work.\n",
      "\n",
      "8. Obtain relevant certifications: Various certifications such as CFA (Chartered Financial Analyst) or CFP (Certified Financial Planner) can provide a structured learning pathway and boost career opportunities.\n",
      "\n",
      "9. Set personal financial goals and work towards achieving them: Managing personal finances effectively is an integral part of understanding finance.\n",
      "\n",
      "10. Consult financial advisors: Lastly, for any advanced financial planning or investment strategies, it may be helpful to consult with a certified financial planner or advisor. \n",
      "\n",
      "Remember, learning finance is progressive - the more you explore, the more you understand.\n"
     ]
    }
   ],
   "source": [
    "print(\"GPT-4 Response:\\n\\n\")\n",
    "print(row[\"gpt_4_response\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "GPT-4 gave us a nice and detailed anwer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Claude 3 Opus Response:\n",
      "\n",
      "\n",
      "Getting a grip on your finances involves several steps:\n",
      "\n",
      "1. Track your income and expenses: Use a budgeting app or spreadsheet to monitor your cash flow. This will help you understand where your money is going and identify areas where you can cut back.\n",
      "\n",
      "2. Create a budget: Based on your income and expenses, create a realistic budget that allocates your money towards essential expenses, savings, and discretionary spending.\n",
      "\n",
      "3. Set financial goals: Establish short-term and long-term financial goals, such as paying off debt, saving for emergencies, or planning for retirement.\n",
      "\n",
      "4. Pay off high-interest debt: Prioritize paying off high-interest debt, like credit card balances, to reduce the amount of interest you pay over time.\n",
      "\n",
      "5. Build an emergency fund: Aim to save enough money to cover 3-6 months' worth of expenses in case of unexpected events like job loss or medical emergencies.\n",
      "\n",
      "6. Save and invest regularly: Automate your savings and investments to build wealth over time. Take advantage of employer-sponsored retirement plans, like 401(k)s, and consider opening an Individual Retirement Account (IRA).\n",
      "\n",
      "7. Educate yourself:\n"
     ]
    }
   ],
   "source": [
    "print(\"Claude 3 Opus Response:\\n\\n\")\n",
    "print(row[\"claude_3_opus_response\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So did Claude 3 Opus. If we compare the two responses, we can see that Claude 3 Opus has given a response that is very similar to the response from GPT-4. Let's see the scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Response Matching Score: 0.9729729729729729\n"
     ]
    }
   ],
   "source": [
    "print(\"Response Matching Score:\", row[\"score_response_match\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The score is ~0.97. This aligns with our observation that the responses are very similar. We can conclude that Claude 3 Opus is a good alternative to GPT-4. Now let's do the same to compare Claude 3 Sonnet and GPT-3.5-Turbo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Experiment 2: Claude 3 Sonnet vs GPT-3.5-Turbo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get responses from Claude 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/25 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 25/25 [05:21<00:00, 12.86s/it]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (25, 5)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>question</th><th>context</th><th>idx</th><th>model</th><th>claude_3_sonnet_response</th></tr><tr><td>str</td><td>str</td><td>i64</td><td>str</td><td>str</td></tr></thead><tbody><tr><td>&quot;How to get a g…</td><td>&quot;Try downloadin…</td><td>1</td><td>&quot;claude-3-sonne…</td><td>&quot;Here are some …</td></tr><tr><td>&quot;How do “held” …</td><td>&quot;&quot;The &quot;&quot;hold&quot;&quot; …</td><td>2</td><td>&quot;claude-3-sonne…</td><td>&quot;On traditional…</td></tr><tr><td>&quot;Does negative …</td><td>&quot;P/E is the num…</td><td>3</td><td>&quot;claude-3-sonne…</td><td>&quot;A negative pri…</td></tr><tr><td>&quot;Should a retai…</td><td>&quot;&quot;That\\&#x27;s like …</td><td>4</td><td>&quot;claude-3-sonne…</td><td>&quot;The decision t…</td></tr><tr><td>&quot;Possibility to…</td><td>&quot;&quot;As user quid …</td><td>5</td><td>&quot;claude-3-sonne…</td><td>&quot;In Canada, you…</td></tr><tr><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td></tr><tr><td>&quot;Discuss the ro…</td><td>&quot;Inflation is a…</td><td>21</td><td>&quot;claude-3-sonne…</td><td>&quot;Inflation play…</td></tr><tr><td>&quot;Explain the co…</td><td>&quot;\n",
       "\n",
       "The Earth&#x27;s …</td><td>22</td><td>&quot;claude-3-sonne…</td><td>&quot;Plate tectonic…</td></tr><tr><td>&quot;How did the su…</td><td>&quot;\n",
       "\n",
       "\n",
       "The Surreal…</td><td>23</td><td>&quot;claude-3-sonne…</td><td>&quot;The surrealist…</td></tr><tr><td>&quot;Discuss the im…</td><td>&quot;\n",
       "\n",
       "Globalizatio…</td><td>24</td><td>&quot;claude-3-sonne…</td><td>&quot;Globalization …</td></tr><tr><td>&quot;What are the k…</td><td>&quot;\n",
       "In the realm …</td><td>25</td><td>&quot;claude-3-sonne…</td><td>&quot;Viral and bact…</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (25, 5)\n",
       "┌───────────────────────┬──────────────────────┬─────┬──────────────────────┬──────────────────────┐\n",
       "│ question              ┆ context              ┆ idx ┆ model                ┆ claude_3_sonnet_resp │\n",
       "│ ---                   ┆ ---                  ┆ --- ┆ ---                  ┆ onse                 │\n",
       "│ str                   ┆ str                  ┆ i64 ┆ str                  ┆ ---                  │\n",
       "│                       ┆                      ┆     ┆                      ┆ str                  │\n",
       "╞═══════════════════════╪══════════════════════╪═════╪══════════════════════╪══════════════════════╡\n",
       "│ How to get a grip on  ┆ Try downloading a    ┆ 1   ┆ claude-3-sonnet-2024 ┆ Here are some tips   │\n",
       "│ finance?'             ┆ finance app li…      ┆     ┆ 0229                 ┆ to help get a…       │\n",
       "│ How do “held” amounts ┆ \"The \"\"hold\"\" is     ┆ 2   ┆ claude-3-sonnet-2024 ┆ On traditional       │\n",
       "│ appear on …           ┆ just placeholde…     ┆     ┆ 0229                 ┆ credit card state…   │\n",
       "│ Does negative P/E     ┆ P/E is the number of ┆ 3   ┆ claude-3-sonnet-2024 ┆ A negative           │\n",
       "│ ratio mean sto…       ┆ years it wo…         ┆     ┆ 0229                 ┆ price-to-earnings    │\n",
       "│                       ┆                      ┆     ┆                      ┆ (P/…                 │\n",
       "│ Should a retail       ┆ \"That\\'s like a car  ┆ 4   ┆ claude-3-sonnet-2024 ┆ The decision to      │\n",
       "│ trader choose a …     ┆ dealer adver…        ┆     ┆ 0229                 ┆ choose a broker …    │\n",
       "│ Possibility to buy    ┆ \"As user quid states ┆ 5   ┆ claude-3-sonnet-2024 ┆ In Canada, you can   │\n",
       "│ index funds a…        ┆ in his answ…         ┆     ┆ 0229                 ┆ buy index fun…       │\n",
       "│ …                     ┆ …                    ┆ …   ┆ …                    ┆ …                    │\n",
       "│ Discuss the role of   ┆ Inflation is a       ┆ 21  ┆ claude-3-sonnet-2024 ┆ Inflation plays a    │\n",
       "│ inflation in…         ┆ pervasive economi…   ┆     ┆ 0229                 ┆ significant ro…      │\n",
       "│ Explain the concept   ┆                      ┆ 22  ┆ claude-3-sonnet-2024 ┆ Plate tectonics is a │\n",
       "│ of plate tec…         ┆                      ┆     ┆ 0229                 ┆ scientific …         │\n",
       "│                       ┆ The Earth's dynamic  ┆     ┆                      ┆                      │\n",
       "│                       ┆ and ever-c…          ┆     ┆                      ┆                      │\n",
       "│ How did the           ┆                      ┆ 23  ┆ claude-3-sonnet-2024 ┆ The surrealist       │\n",
       "│ surrealist movement … ┆                      ┆     ┆ 0229                 ┆ movement had a pr…   │\n",
       "│                       ┆                      ┆     ┆                      ┆                      │\n",
       "│                       ┆ The Surrealist       ┆     ┆                      ┆                      │\n",
       "│                       ┆ movement, whic…      ┆     ┆                      ┆                      │\n",
       "│ Discuss the impact of ┆                      ┆ 24  ┆ claude-3-sonnet-2024 ┆ Globalization has    │\n",
       "│ globalizat…           ┆                      ┆     ┆ 0229                 ┆ had a signific…      │\n",
       "│                       ┆ Globalization,       ┆     ┆                      ┆                      │\n",
       "│                       ┆ characterized b…     ┆     ┆                      ┆                      │\n",
       "│ What are the key      ┆                      ┆ 25  ┆ claude-3-sonnet-2024 ┆ Viral and bacterial  │\n",
       "│ differences bet…      ┆ In the realm of      ┆     ┆ 0229                 ┆ infections a…        │\n",
       "│                       ┆ infectious dise…     ┆     ┆                      ┆                      │\n",
       "└───────────────────────┴──────────────────────┴─────┴──────────────────────┴──────────────────────┘"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset_path=\"./uptrain_benchmark.jsonl\"\n",
    "claude_settings = Settings(model=\"claude-3-sonnet-20240229\", rpm_limit=4)\n",
    "dataset = JsonReader(fpath=dataset_path).setup(settings=claude_settings).run()[\"output\"]\n",
    "\n",
    "dataset = dataset.with_columns([pl.lit(\"claude-3-sonnet-20240229\").alias(\"model\")])\n",
    "dataset_with_claude_responses = TextCompletion(col_in_prompt=\"question\", col_out_completion=\"claude_3_sonnet_response\").setup(settings=claude_settings).run(dataset)[\"output\"]\n",
    "dataset_with_claude_responses"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Responses from GPT-4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|          | 0/25 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 25/25 [00:06<00:00,  3.76it/s]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (25, 6)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>question</th><th>context</th><th>idx</th><th>model</th><th>claude_3_sonnet_response</th><th>gpt_35_turbo_response</th></tr><tr><td>str</td><td>str</td><td>i64</td><td>str</td><td>str</td><td>str</td></tr></thead><tbody><tr><td>&quot;How to get a g…</td><td>&quot;Try downloadin…</td><td>1</td><td>&quot;gpt-3.5-turbo&quot;</td><td>&quot;Here are some …</td><td>&quot;1. Set financi…</td></tr><tr><td>&quot;How do “held” …</td><td>&quot;&quot;The &quot;&quot;hold&quot;&quot; …</td><td>2</td><td>&quot;gpt-3.5-turbo&quot;</td><td>&quot;On traditional…</td><td>&quot;&quot;Held&quot; amounts…</td></tr><tr><td>&quot;Does negative …</td><td>&quot;P/E is the num…</td><td>3</td><td>&quot;gpt-3.5-turbo&quot;</td><td>&quot;A negative pri…</td><td>&quot;A negative P/E…</td></tr><tr><td>&quot;Should a retai…</td><td>&quot;&quot;That\\&#x27;s like …</td><td>4</td><td>&quot;gpt-3.5-turbo&quot;</td><td>&quot;The decision t…</td><td>&quot;It ultimately …</td></tr><tr><td>&quot;Possibility to…</td><td>&quot;&quot;As user quid …</td><td>5</td><td>&quot;gpt-3.5-turbo&quot;</td><td>&quot;In Canada, you…</td><td>&quot;In Canada, it …</td></tr><tr><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td><td>&hellip;</td></tr><tr><td>&quot;Discuss the ro…</td><td>&quot;Inflation is a…</td><td>21</td><td>&quot;gpt-3.5-turbo&quot;</td><td>&quot;Inflation play…</td><td>&quot;Inflation is t…</td></tr><tr><td>&quot;Explain the co…</td><td>&quot;\n",
       "\n",
       "The Earth&#x27;s …</td><td>22</td><td>&quot;gpt-3.5-turbo&quot;</td><td>&quot;Plate tectonic…</td><td>&quot;Plate tectonic…</td></tr><tr><td>&quot;How did the su…</td><td>&quot;\n",
       "\n",
       "\n",
       "The Surreal…</td><td>23</td><td>&quot;gpt-3.5-turbo&quot;</td><td>&quot;The surrealist…</td><td>&quot;The surrealist…</td></tr><tr><td>&quot;Discuss the im…</td><td>&quot;\n",
       "\n",
       "Globalizatio…</td><td>24</td><td>&quot;gpt-3.5-turbo&quot;</td><td>&quot;Globalization …</td><td>&quot;Globalization …</td></tr><tr><td>&quot;What are the k…</td><td>&quot;\n",
       "In the realm …</td><td>25</td><td>&quot;gpt-3.5-turbo&quot;</td><td>&quot;Viral and bact…</td><td>&quot;One of the key…</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (25, 6)\n",
       "┌───────────────────┬──────────────────┬─────┬───────────────┬──────────────────┬──────────────────┐\n",
       "│ question          ┆ context          ┆ idx ┆ model         ┆ claude_3_sonnet_ ┆ gpt_35_turbo_res │\n",
       "│ ---               ┆ ---              ┆ --- ┆ ---           ┆ response         ┆ ponse            │\n",
       "│ str               ┆ str              ┆ i64 ┆ str           ┆ ---              ┆ ---              │\n",
       "│                   ┆                  ┆     ┆               ┆ str              ┆ str              │\n",
       "╞═══════════════════╪══════════════════╪═════╪═══════════════╪══════════════════╪══════════════════╡\n",
       "│ How to get a grip ┆ Try downloading  ┆ 1   ┆ gpt-3.5-turbo ┆ Here are some    ┆ 1. Set financial │\n",
       "│ on finance?'      ┆ a finance app    ┆     ┆               ┆ tips to help get ┆ goals: Write do… │\n",
       "│                   ┆ li…              ┆     ┆               ┆ a…               ┆                  │\n",
       "│ How do “held”     ┆ \"The \"\"hold\"\" is ┆ 2   ┆ gpt-3.5-turbo ┆ On traditional   ┆ \"Held\" amounts   │\n",
       "│ amounts appear on ┆ just placeholde… ┆     ┆               ┆ credit card      ┆ typically appear │\n",
       "│ …                 ┆                  ┆     ┆               ┆ state…           ┆ …                │\n",
       "│ Does negative P/E ┆ P/E is the       ┆ 3   ┆ gpt-3.5-turbo ┆ A negative price ┆ A negative P/E   │\n",
       "│ ratio mean sto…   ┆ number of years  ┆     ┆               ┆ -to-earnings     ┆ ratio typically  │\n",
       "│                   ┆ it wo…           ┆     ┆               ┆ (P/…             ┆ i…               │\n",
       "│ Should a retail   ┆ \"That\\'s like a  ┆ 4   ┆ gpt-3.5-turbo ┆ The decision to  ┆ It ultimately    │\n",
       "│ trader choose a … ┆ car dealer       ┆     ┆               ┆ choose a broker  ┆ depends on the   │\n",
       "│                   ┆ adver…           ┆     ┆               ┆ …                ┆ tra…             │\n",
       "│ Possibility to    ┆ \"As user quid    ┆ 5   ┆ gpt-3.5-turbo ┆ In Canada, you   ┆ In Canada, it is │\n",
       "│ buy index funds   ┆ states in his    ┆     ┆               ┆ can buy index    ┆ possible to buy… │\n",
       "│ a…                ┆ answ…            ┆     ┆               ┆ fun…             ┆                  │\n",
       "│ …                 ┆ …                ┆ …   ┆ …             ┆ …                ┆ …                │\n",
       "│ Discuss the role  ┆ Inflation is a   ┆ 21  ┆ gpt-3.5-turbo ┆ Inflation plays  ┆ Inflation is the │\n",
       "│ of inflation in…  ┆ pervasive        ┆     ┆               ┆ a significant    ┆ rate at which t… │\n",
       "│                   ┆ economi…         ┆     ┆               ┆ ro…              ┆                  │\n",
       "│ Explain the       ┆                  ┆ 22  ┆ gpt-3.5-turbo ┆ Plate tectonics  ┆ Plate tectonics  │\n",
       "│ concept of plate  ┆                  ┆     ┆               ┆ is a scientific  ┆ is a scientific  │\n",
       "│ tec…              ┆ The Earth's      ┆     ┆               ┆ …                ┆ …                │\n",
       "│                   ┆ dynamic and      ┆     ┆               ┆                  ┆                  │\n",
       "│                   ┆ ever-c…          ┆     ┆               ┆                  ┆                  │\n",
       "│ How did the       ┆                  ┆ 23  ┆ gpt-3.5-turbo ┆ The surrealist   ┆ The surrealist   │\n",
       "│ surrealist        ┆                  ┆     ┆               ┆ movement had a   ┆ movement had a   │\n",
       "│ movement …        ┆                  ┆     ┆               ┆ pr…              ┆ si…              │\n",
       "│                   ┆ The Surrealist   ┆     ┆               ┆                  ┆                  │\n",
       "│                   ┆ movement, whic…  ┆     ┆               ┆                  ┆                  │\n",
       "│ Discuss the       ┆                  ┆ 24  ┆ gpt-3.5-turbo ┆ Globalization    ┆ Globalization    │\n",
       "│ impact of         ┆                  ┆     ┆               ┆ has had a        ┆ has had a        │\n",
       "│ globalizat…       ┆ Globalization,   ┆     ┆               ┆ signific…        ┆ signific…        │\n",
       "│                   ┆ characterized b… ┆     ┆               ┆                  ┆                  │\n",
       "│ What are the key  ┆                  ┆ 25  ┆ gpt-3.5-turbo ┆ Viral and        ┆ One of the key   │\n",
       "│ differences bet…  ┆ In the realm of  ┆     ┆               ┆ bacterial        ┆ differences      │\n",
       "│                   ┆ infectious dise… ┆     ┆               ┆ infections a…    ┆ betwe…           │\n",
       "└───────────────────┴──────────────────┴─────┴───────────────┴──────────────────┴──────────────────┘"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gpt_settings = Settings(model=\"gpt-3.5-turbo\", rpm_limit=100)\n",
    "dataset = dataset_with_claude_responses.with_columns([pl.lit(\"gpt-3.5-turbo\").alias(\"model\")])\n",
    "experiment_dataset = TextCompletion(col_in_prompt=\"question\", col_out_completion=\"gpt_35_turbo_response\").setup(settings=gpt_settings).run(dataset)[\"output\"]\n",
    "experiment_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use the Response Matching operator to get the scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m2024-03-07 10:49:58.773\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate_on_server\u001b[0m:\u001b[36m341\u001b[0m - \u001b[1mSending evaluation request for rows 0 to <50 to the Uptrain\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[32m2024-03-07 10:50:24.652\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[36muptrain.framework.evalllm\u001b[0m:\u001b[36mevaluate\u001b[0m:\u001b[36m330\u001b[0m - \u001b[1mServer is not running!\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "from uptrain import EvalLLM, ResponseMatching\n",
    "\n",
    "settings = Settings(evaluate_locally=False)\n",
    "\n",
    "# Drop the \"context\" and \"model\" columns as they are not needed for local evaluation\n",
    "experiment_dataset = experiment_dataset.drop([\"context\", \"model\"])\n",
    "\n",
    "eval_llm = EvalLLM(settings=settings)\n",
    "results = eval_llm.evaluate(\n",
    "    data=experiment_dataset,\n",
    "    checks=[\n",
    "        ResponseMatching(\n",
    "            method=\"llm\",\n",
    "        )\n",
    "    ],\n",
    "    schema={\n",
    "        \"question\": \"question\",\n",
    "        \"response\": \"claude_3_sonnet_response\",\n",
    "        \"ground_truth\": \"gpt_35_turbo_response\",\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a different example and see the scores for each model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Question: How do “held” amounts appear on statements and affect balances of traditional credit cards?'\n"
     ]
    }
   ],
   "source": [
    "row = results[1]\n",
    "print(\"Question:\", row[\"question\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPT-3.5-Turbo Response:\n",
      "\n",
      "\n",
      "\"Held\" amounts typically appear on credit card statements as pending charges or authorizations. These are temporary holds placed on the cardholder's account for a certain amount of money, such as when making a hotel reservation or renting a car. The held amount is not deducted from the available balance immediately but may affect the overall available credit on the card.\n",
      "\n",
      "For traditional credit cards, these held amounts do not impact the current balance that is due for payment. However, they can affect the credit available to the cardholder if the held amount is close to or equal to the available credit limit. This can potentially limit the cardholder's ability to make additional purchases until the held amount is no longer pending.\n",
      "\n",
      "It is important for cardholders to keep track of held amounts and understand how they can impact their available credit and spending ability. Held amounts will eventually be released and the actual charge will be posted to the account, at which point it will be reflected in the card balance.\n"
     ]
    }
   ],
   "source": [
    "print(\"GPT-3.5-Turbo Response:\\n\\n\")\n",
    "print(row[\"gpt_35_turbo_response\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The response from GPT-3.5-Turbo is very detailed and informative."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Claude 3 Sonnet Response:\n",
      "\n",
      "\n",
      "On traditional credit card statements, any \"held\" amounts are typically shown separately from the current balance owed. Here's how they are displayed and affect balances:\n",
      "\n",
      "1. Current Balance: This is the total amount you owe on your credit card as of the statement date. It includes all new charges, fees, interest charges, and any remaining balance from the previous statement that wasn't paid in full.\n",
      "\n",
      "2. Held Amounts/Pending Transactions: Many credit card issuers will display a separate section or line item for \"held\" or \"pending\" amounts. These are transactions that have been authorized but not yet posted or settled to your account.\n",
      "\n",
      "3. Available Credit: Your available credit is your total credit limit minus your current balance and any held amounts. The held amounts temporarily reduce your available credit even though they haven't been added to the current balance yet.\n",
      "\n",
      "4. Impact on Balance: Held amounts do not directly affect your current balance on the statement. However, once those pending transactions settle and post, they will be added to your next statement's balance.\n",
      "\n",
      "It's important to note that held amounts are temporary and usually drop off after a few days once the final transaction amount clears. This helps ensure you have enough\n"
     ]
    }
   ],
   "source": [
    "print(\"Claude 3 Sonnet Response:\\n\\n\")\n",
    "print(row[\"claude_3_sonnet_response\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The response from Claude 3 Sonnet is also detailed and informative. If we compare the two responses, we can see that the information is the same, but the style is different. Let's see the scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Response Matching Score: 0.9411764706\n"
     ]
    }
   ],
   "source": [
    "print(\"Response Matching Score:\", row[\"score_response_match\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The score is ~0.94. This aligns with our observation that the responses are very similar. We can conclude that Claude 3 Sonnet is a good alternative to GPT-3.5-Turbo."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
